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ABSTRACT 

Three pyramidal adaptive tests and a conventio 
peaked test were constructed and administered by computer to t 
groups of students enrolled in undergraduate psychology course 
methods of scoring pyramidal tests were evaluated with respect 
score distributions, stability, and the degree of relationship 
scoring methods and between pyramidal scoring methods and scor 
the conventional test. For both the pyramidal tests and the 
conventional test, score distributions were platkurtic and pos 
skewed. Two methods of scoring the pyramidal tests consistent! 
an equal or greater proportion of the range of possible scores 
the conventional test. The IS-stage pyramidal tests showed 
test-retest correlations which were only slightly lower than t 
the U'^-item conventional test. However, when the effects of me 
were considered, the pyramidal strategy yielded more stable ab 
estimates than conventional tests of equivalent length. The 
correlation between pyramidal test scores and those on convent 
tests ranged from .82 to .86. One pair of scoring methods was 
to be perfectly correlated for properly constructed pyramidal 
a second pair correlated almost perfectly. (Author/SE) 
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20. A8STRACT (Contlnuo on fO^or^o oltf It nooooomy tiionUfy bf block nxanbitf} 

Three pyramidal adaptive tests and a conventional poaked 
test were constructed and administered by time-sharrd romput».*r 
t > two separate groups of students enrolled in undergr;ulun t e 
psychology courses. Six different methods of scoring pxrnmiclal 
tests were evaluated with respect to score distributions, 
stability, and the degree of relationship among scorin^j methods 
and between pyramidal sco.ring methods nnd scores on tfie 
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ADMlNLSlEUiil) i>VUAMJl)AL AHli/iTV TICSrilNG 



<'tjii\ t>nt.u>na I lost.-i t)t' abi lity Uavt* l-.ratiit,i.onn I ly boeu 
fulm i nlHtertjcl hy f»npor and penci l to lar#^» ^jroups of iml i - 
vidaals. Kaoh nubjoot; 1s oxpoctoii to altompt ovory itiMii 
in tli«' torft i'»>^^ard I OSS of it.s difficulty or his/ht.T ability. 
Administt'at ioji of ability te8t itoni.s by interactivo computoi^ 
.sysfoms bas mado possible the tailoring? of tests to the 
ability of tho individual testeo. When an ability test is 
adniinistei'ed by computoi', items arc selected for presonta** 
t.ion according to a pre-determined set of rules or "stratu^jy" 
whicb takes into account the testoo's responses to pre- 
vioUvSly administered items. Adaptive testing stratefjies are 
differentiated by the set of rules used to determine item 
selectioti (Weiss, 197^). The rational.e for adaptive testing 
is that, by eliminating those items which are either too 
difficult or too easy for the person taking a test, its 
reliability and validity may be improved and testing time 
shorti^ned. Weiss and Betz (1973) have described the various 
strategies used and have summarized the research literature 
on adaptive testing. 

The strategy most frequently used in adaptive testing 
has been called "branched", "sequential", or "pyramidal" 
testing. This method requires that items be arranged in 
a triangular structure according to difficulty. Figure 1 
illustrates a pyramidal item structure. Typically, the 
first item administered (item 1, stage l) is of median 
difficulty for the group taking the test, and is represented 
at the top of the pyramidal structure. The second item 
presented (stage 2) is contingent upon whether the response 
to the first item was correct or incorrect. If the testee 
answers ch - first item correctly, an item of greater diffi- 
culty (item 3) is administered next. An incorrect response 
to item 1 results in the administration of a second-stage 
Item of iosser difficulty (item 2). Thus, as Figure 1 
shows, there are two items at the second level or " tage" 
of the pyramid. The testee is routed to an item at stage 
3 according to his responso to the stage 2 item; again a 
more difficult item follows a correct response, and an 
easier item follows nn incorrect response. The branching 
procedure is repeated until the subject has attc.mpted ono 
Item at each of a fixed number of stages. The solid linos 
connecting item numbers in Figure 1 illustrates the paths 
of two hypothetical test^es through the pyramidal structure. 

The number of items attempted by a testet^ is equal to 
the number of stages (provided that one item is administered" 
at each stage), and is only a frnction of the total number 
of Items needed to construct the pyramidal structure. In 
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t h«j pyrainidci l HtriU't Mi'o j^howii in ri/vi*^'*' each t i.»«ito<> won lei 
oti(M>uut'n* only 10 oi* ilu) 55 itouis avaiiabJo tor ailuiinii^t lii- 
tioru Many vai:i a I ions of tlii:?i wotluni of t08tin/5 luivu lMM>n 
sug^^«>H toil (Wu:i .S8, 197*^0 • i'OV oxamplo, t;h€> number oV i touis 
T.i) bii adtnlniH tim>U nt onu statjU bo sot at tla<»v* or rive* 

in 8urh ua8t?H^ brnnchln^i' i s briv^tHl on tin* niunbcr of i t imiis 
answorotl uorroi:ily at. a ^jivtjn sia^?t». Instoad of rontinf^ 
t*run! Mem ro itoiUt thu lofst.oo is branch^Ml Vvom ouv blo<?k oV 
i.liMn« to anov.lioi' with all .ttetiiB in a block having about, tlu? 
Hainu clirtMcu 1 t.y , 

1*lai inui^omoiii or (ifjcreinont in the dit'tMcui i y of iionm «tt 
ont^ siago to tliost> in tht* noxi in Kigiu*u 1.) is caiUHl 

the "step size" *jiuJ may bo either fixed or variablo* Somv 
p\ !'amidal t^.sts (Patorsoni P)6:3; Loi*d ^ i97in) havo usod a 
lai*^5<? .sti?p sizi^ at the be^?innin(; of tht} tvst to mako rtOa- 
t.i\f.'Ly conrs<.» dis t j.nctions among ability l.ovoL«; as tofe^tinf*; 
proceeds the str^p Biy,o becomes smaller or "shrinks" enablinfj 
finoi- and finer discriminations amonff tostees, ].n mosrt faMc^.s^ 
the increment in difficulty for a coiu'eot response? is e(|ual 
to the docroment in difficulty following an incorrect re- 
sponse. This insui^es symmetric; branching throu^jhout tostiniy, 
and requires that one item at^ each stage be att^empted. Tliis 
has been called an "up-one ^stage/ /down-one /stage7" stratek^?> 
or "e(]ual offset "• 

a 

The term "unequal offset" has been used to explain branch 
ing which is asymmetric (Lord, 1^'70)» In such a case, follow- 
ing a correct response a testee is i^oured to a more difficult 
item in the next stage, but after an incorrect response, 
routing occurs to a much easier item two or even three stages 
further into the pyramid (i.e., one or more stages is skipped) 
This is known as an "up-one/down-two (or -three)" strategy 
and is most commonly used as a correction for guessing. In 
this variation the number of items administored is loss than 
the number of stages, unless the testee responds correcti\ 
to r^ll items administered. 

Pyramidal tests mey be scorc^i by n number of diff*er<»ni 
methods. First, the rank of the difficulty of the final 
item attempted can be cnn«idt?rt»d the i. ndivi (hinl 's scoro 
(Bayroff, Thomfjs ri And(»rson, P)00 ; SooU^y, Morton A Ariripison , 
1962; Waters t<: Hayroff, 1 ^7 I ) . The p\ riimich-j 1 t t*Ht i 1 I m s t r.-i t t'd 
in Figure I wou Ul , ther'^f ore , } j o 1 d 10 scores • The nutnh* r 
of ranks may be (ioubl<'cl by assigning a lii^^her rank to a 
subject answering the final item corre?(^tly, than to on<^ 
who does not (Waters, M)^)'^; Rayroff t<\ Seel(^\, inoj). Tin- 
difficult\ lc?vel of tht^ final i t eni rt\aclie(i (e.g., Raxrofl, 
1m6m) may also be cons i <ie r^t?d an estimate of a te.^t(»e's 
ability (e.g., - 1 • j and +1.0 foi the two te.sti'f'S shown i fi 



I'.i^viii'O l), Anothoi' methotl, which i<ik«s into account th»> 
corifM't noHss or incorr<>ctnoj«s oV ilw r»>Hj>onsi> to tho fiuMl 
itom invoivns V>ranch,ing tho subjoct to an hypothet Icui 
i tvm fol l owine: th«> lust J torn atlministerod nnd estlmat iiifj 
its Uifficultv. This has haen nameti thi> '•flnai iiodo fr«onr«'» 
(Han.^<'n, 1969) or "final difficulty scoro" (Lord, li)7lb). 
To distinguish this mothotl from the one utilizing the 
difficulty of the last item, it can be called the "n + l'** 
itom" scoring method. Another .scorin^^ itioihod involves the 
averafie of nil itf^ms attempted or all itews corroctly 
answorod. Lox-d (1^)70) has usod a related avei^ugin^ tnot.hod 
which oiiminatos tho firsst item (since everyone attempts it) 
but includes tlie n + 1*'' item. He considers it the "score 
of choice" (Lord, lV)71b, p. 709) for most up-one/down-one 
.Htrate^ios. Finally, a more complicated scoring system has 
been propose*! by Hansen (1969) which assigns an estimated 
score to each item in the pyramid. 

Empiric al studies . Karly research with pyramidal tests 
used paper and pencil administration. Krnthwohl and Huyser 
(195^>) administered an eight-stage (one item per stage) 
and a four-stage (two items per stage) pyramid to 100 college 
students. They obtained correlations of .78 and .68 between 
the pyramidal tests and 60-item parent tests. Their pyra- 
midal tests were completed more quickly than the conventional 
tests, and provided almost as much information. 

Bayroff, Thomas and Anderson ( I960), following Krathwohl's 
approach, constructed four six-stage pyramidal tests using a 
decreasing step size. Based on their response choice on the 
first item testees vrere routed to one of three alternative 
items at stage 2. Those who selected the correct alterna- 
tive were administered a more difficult item; those who re- 
sponded with either of two plausible distractors were routed 
to an item of the same difficulty as the initial itemj and 
those who chose the least popular incorrect response were 
given an easier item. For the remaining stages, ordinarx 
up-one/down-one branching was used. Seeley, Morton and 
Anderson (I962) administered these six-stage pyramidal 
tests to 327 men and correlated scores on the pyramidal 
tests with those obtained on corresponding subtests of a 
longer conventional test. For both verbal and numeric items, 
the correlation between .ae pyramidal and conventional test^ 
was .63; however, the distribution of pyramidal scores was 
highly skewed with a large number of scores at the high end 
of the distribution. These authors also reported that fi 
number of the Low ability testoes did not follow the routing 
instructions, resulting in unusable test record.s for th*>s«' 
examinees . 



Wooil (i Mu9) ruitnin j 8 1 <vri}d paptu* and {u>nuil |i> ^^^'^^^'i ^^^^ I 
of 5i iirul () i^tm^vs to <M ^;tudtmts. Sti?p ur.*^ 
tixiKl nt p ss tO'i? t ho initial i>ti>tii was t>r uh>dioti dirtUHilts 
{p a t^O)? an np-oi^o/Uown**ont^ branchiu^t H\lo was usiHl; 
and 1 Ik^ suni'O was t lu» nunibor of it oms curroi*i ly ansiworiHl 
in <»aoh lust . Validity ot* %ho losts way do t ortiiined by 
cutuu* 1 a t ionvS ut t:o^^t 8cort>8 with ooui'se ^rrcnlt^^ in conipaivi* 
i^on with tho8t* obtained with a *K>-itern convt?n t iuna I tost. 
CorruiationH botwot?n tho pyx^amidal t^icort^s and courso ^^rado?^ 
woi*o all bojow combining scores on tho throe pyiMiiiidal 

tosts incroased the correlation to ,51. The corrolatioii 
boiwoen the co.ivon t iona 1 test and f^t^adoH was .o8» and a tost 
coniposod of the fifteen most discriminating itoms in tho 
conventiorjai tost Imci a correlation of .52 with courso f^rades 
Wood concluded that a conventional tt^st is just as gooil ns 
a combination of pyramidal tests composed of the same numboi* 
of items. 

More recent empirical studies have used computers to 
administer adaptive tests. Bayroff and See ley (1^)0?) ad- 
ministered two eight-stage pyramidal tests by teletype to 
10.? men. The step sixe used was p a .03 azid final item 
difficulties ranged from p s .95 to p «= .120; scores were 
based on the correctness or incorrectness of the final itom» 
providing a score range of 17 points. Testees also completed 
4o-item numerical and 50-item verbal conventional tests. 
Correlations between the adaptive and conventional tests 
were .83 and .79 (corrected for restriction of range) com- 
pared to an estimated correlation between eight-item con- 
ventional tests and the ^iO- and 50-item conventional tests 
of .75 and .67. Thus* pyramidal tests proved to be more 
highly related to the long conventional tests than were 
conventional tests of comparable length. By use of the ^ 
Spearman-Brown formula, it was found that conventional tests 
would require at least twice as many items as the pyramidal 
tests to achieve the same correlation with the criterion 
paprr and pencil tests. 

Hansen ( 1969 ) adminis tered five dif f oren t pyramitici L 
tests by teletype to college freshmen. The numbei* of 
stages per test was either thin^e or Icur with each studo!)t 
an.s\\9ring a total of 17 items. Hans*»n used a step sizo 
of p = .10 and scored his tests by four difror»»nt methods. 
Scores on the pyramidal tests were correlated with scorv^ 
on a one-hour classroom exam on the same material complt^tod 
ono week before the pyramidal tests were administered, anil 
with scores on anothez^ achievement test and final courso 
grade. The conventional test, even when t^quatod for length, 
was found to have a lower internal consistency roliubiliTx 
than any of the five pyramidal tests. Scoros tor tht» p\i\uriLd. 



ttnsis wt»ro vMst ribuToil inoi o » tM^ti.uu\"uliii'iy than those or 
the i^onvt'iiiional tost whioh hmi ,1 tU'ftatively jskfuoil dis- 
tribuiiou, liojSiiUte* also .^huutnl that th<> {pyramidal towts 
uv^re oowpltntid in an avt>»YuJje ist" rivt> minut^f^ lej^s iimo 
than iht* oonvont iona I tt>f<t. Pyramidal tojsis scored by 
t\v»> iiu»-hodt4 vilsa shou>«ii hi*T;hoi' ooxh'o I a t ionts than the oon- 
vt>nt tot al te!*t Willi final ^^^t»'>'Hh> and xUo uchiovowont tost 
viittMvion, A ;^«ocontl .study ptodiicod sim.ilai^ ihjbuIijs, 

Bryjiion U'^Tl) compared two rive-^stafjo pyramidal tot^ts 
with two rive-itum cunvontxonal tessts on t.huir cot ruiati on 
with luO-itom parent to:?ts. Convoiuional to8t8 w<>rt? ad- 
ministorod by pnpt^r an' pencil whi lo the pyramids I tost a 
woro fiilmini "^tet'od uj*ing a cathodo ray computer terminal. 
In on*? ot th»a pyramidal tests, tho itom solootion pro- 
coduro soquont ial I y seloctod iiom?« basoti on the most dis- 
criminating item for iill thos»» who roach a ^Tivon point in 
tho pyramidal structure, whilo tho other ustnl an itom selec- 
tion proceduxo designed to maximize tho prodiction or totnl 
scor«? (Wolfe, 1^?70). Both pyramids had a variablo stop sizi 
Each pyramidal strategy was administer^^'d to two groups of 
subjects and tho conventional tests wore administeiod 
to comparable groups of 250 individuals. Hesults indicated 
that one of the short conventional tests was moro highly 
correlatod with total test score than either of the pyrami- 
d-^a tests. One of the pyramids had lower correlations with 
tota*l test score than either of the conventional tests. 

Simulation studies . Simulation involves scoring a con«. 
ventional test "as if" it had been administered adaptively 
(real data simulation) or using computers to generate hypo- 
thetical subjects, items, and/or test response records 
(computer simulation). Bryson's (l<)7l) investigation com- 
pared her empirical findings with those of a real data 
simulation using the same four pyramidal and conventional 
tests with two groups of 100 subjects. The highest corre- 
l^jtions with total test score were obtained when one of the 
two pyramids was used. The ocher pyramidal strategy had 
correlations less than or equal to one of the conventional 
tests and higher correlations than the other. Iheao finding 
wert» more favorable to .«daptive testiiv^ than \u'V pmpi r.i ea I 
resul ts . 

Linn, Kock and Clrary (l')t)')) invos t iga trd .>s»'ven tiittci- 
ent branching strategic* using r«'a I d;.ta simulation based on 
thr regsponses of '♦,88') students to a 140- itom convont I ona I 
test. For each strategy, thr appropriate items t^rom th»' 
longer tests were s*»lert»ed and srormi as if tlM^ tositM-s had 
attempted oti I y those itt^ms in the ordt-r rtnju.ir<'d by t ln' 
a«.iijptive test. Fivo of t h»» simulat«»(i branehjiif: ra it-f, j t* s 



WfM o two-Htaf^o procedures (Bt»tz A Wt>ivS8, 1^>7'J); tlu> twc> 
rt>ma.ln.in^^ <Jesif^ns woro pyramidnU Tho farf^t was a toti-sta/j^* 
pyramid with a stop si^v of about p « •Ofi, The socond 
pyr.^midal tOBt consisted oV five s^tu;?t>s with five ittMiiB 
inn- stage; thus, 123 itoms wer«-* attomptod by oaoh subjtMJt 
with braiicliiuf? baj^od on a subject *s pt^ri orinauc^,? within each 
block, Hoth pyramids used an equal ofTsf^tt !*yraniidal tests 
wort» compared to five shortenod conventional tostvS of from 
iO to 50 items. Hesults showed that the 10- stage pyramidal 
tost correlated ,87 with total test score; the .?5-item pyra- 
mid correlated and the sliort conventional tests corrolaied 
• 89 to .9b ^ The 23-item pyramid coi-^rula t ion witli total test 
score corresponded to that of a 35-item conventional test. 
Linn vj^ aj^. (1^)09) also obtained scores on two acViievoment 
tests for the same subjects, which were used as criterion 
measures • The 10-item pyramidal test showed a higher coiu-e- 
lation with the critei^ion measures than the conventional test 
of the same length. Similarly, the five-stage .?5-item pyramid 
correlated higher with the criterion tests than the 30- Item 
conventional test. These findings imply that pyramidal test- 
ing can result in gains in validity with fewer items ad- 
ministered in comparison to conventional testing. 

Paterson (1962) conducted a monte carlo computer simu- 
lation study using a pyramidal strategy. Items in the p\ra- 
mid were first structured by difficulty and then ordei^ed by 
discriminations. The first items administered were the most 
discriminating while the later items were less discriminating 
within each level of difficulty. Step size varied as a func- 
tion of item discrimination. If a highly discriminating item 
was answered correctly, the increment in difficulty between 
that item and the next was large. When an item of low dis- 
crimination was answered correctly, the increment in diffi- 
culty was small. Similarly, decrements in stop sizes de- 
pended on the discriminations of items which were answered 
incorrectly. Since items were arranged according to dis- 
criminations, the step sizes at the bof^inning of tho test 
were large and decreased as the testee moved through the 
pyramidal structure . 

Paterson *s pyramid consisted of six stages .md was corn- 
pared with a six-item conventional test for an h\ po t he t icn 1. 
popu 1 at i on of 1 , 500 individual s , wi th 100 peop 1 e a t each of 
15 ability levels. The two testing strategies were coinpniefl 
at five levels of item discrimination under Cvonditinns of 
norma I , rec t angular , and T-shaped dis t r ibut ions of ,ibi 11 1 y . 
The effects of errors in estimating the item parameters 
were studied by including items of inappropriate difficulty 
or discrimination in the py:\unidal tests. The data 1 (»d to 
the conclusion that errors in pai'ameter estimates in pyramidal 
testing did not seriousl\ affect the score distributions 
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obtained • Pyi^imidal testing was found to give bol ter opt i- 
ma tos of ability than conveut lonal tests when i;-shaped uv 
rec taiuju.lcir distribution of ability were assumeil* Pyramidal 
tost Scores were also mor« precise than conventional scoi^s. 
espi»cially at the extrernos of the ability distribution, and 
could predict ability from test scores as well as conven- 
tional tests. 

Theoretical studios . Waters (196^0 conducted a theo- 
retical comparison of a five-stage pyramidal test and four 
conventional five-item tests using Lord's (l>52) model to 
obtain the correlation between test score and undori y ing 
ability for each test. The hypothetical pyramidal test 
used a step size of p = .10, an up- one/down -one branching 
rule, and was scored by two methods. Under either scoring 
method , the correlation between test score and ability was 
higher for the pyramidal test than for any of the conven- 
ti onal tests , whether free -response or mul tiple -choice for- 
mat was used. The pyramidal test produced a more rectangu- 
lar score distribution and a potentially greater dispersion 
of scores than the conventional tests. 

Waters and Bayroff (1971; Waters, 1970) compared 3-, 
10-, and 15-stage pyramids and a ten-stage pyramid with 
two items per stage to conventional tests of the same length. 
Both conventional tests and pyramidal tests differed in the 
variability of item difficulties, and item discriminations 
were systematically varied. The distribution of ability was 
assumed to be normal.. Results showed correlations of test 
score and ability were related to both the distribution of 
item difficulties and item discrimination, that correlations 
for the pyramidal tests were higher than those for the con- 
ventional tests, particularly with highly discriminating 
items, and that the one-i tem-per-s tage pyramids showed higher 
correlations of test scores and ability than the two* 
item-per-stage pyramids . 

Lord has reported several theoretical studies on pyra- 
midal testing (Weiss A Betz, 1973). His analyses, based on 
the mathematics of item characteristic curve theory and iIh? 
theory of Markov chains, compared 10-, 15-, and b()-stafi;(» 
pyramids with conventional tests of 60 items (Lord, 1970. 
1971a, b; Stocking, 196;). Step size>s were sy stoma t i en I ; \ 
varied across tests but remained constant for any givi»n n-st. 
Branching rules studied were up-one/down-one, up-one/dowii- 1 wo , 
up-one/down- three , and up-two/down- three , under a vari«?ty of 
scoring methods. Results showed that for conventionnl t^v-^ts 
the information function was bell-shaped, leptokurtic, and 
symmetric about the median ability level; ability was most 
accurately estimated from test scores for thost^ subjt.^cts at 
or near the median ability. PyramiiJal information function^ 
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uH?ro platykurtic, in some cases approximating a stT'ai^.?ht 
lino, indicating that precision of test scores was more 
nearly equal across ability levels. At the median ability 
level, the 60-item conventional test provided more precise 
measurement than any pyramidal testt However, for abilities 
beyond +^t5 to ^ItO standard deviations the pyramidal tests 
provided more precise measurement • Different methods of 
scoring the pyramid provided different results, as did 
different stepping rules. Lord (1970, 1971a) also investi- 
gated a variable step size procedure adapted from bio-assa\ 
w^ork called the Robbins-Munro procedure. In this strategy 
large increments or decrements in item difficulty occur earl\ 
in the testing process with progressively smaller step sizes 
occurring later in testing. The procedure is designed to 
converge on a difficulty level at which each individual has 
a ,50 probability of answering each item correctly. Although 
this procedure yielded extremely favorable results for py- 
ramidal tests, it requires item pools that are so large as 
to be practically unfeasible, 

Mussio (1972) has attempted to reduce the large number 
of items required in pyramidal testing by adopting "reflect- 
ing barrier" and "retaining barrier" strategj.es. Both modi- 
fications involve truncating the upper and lower tails of 
t?ie pyramidal structure, thus eliminating many items at 
extreme difficulty levels. Like Lord, Mussio presented his 
theoretical results in the form of information curves and 
obtained similar results. Pyramidal tests modified by either 
"barrier" provide less information at the mean of an ability 
distribution than a conventional peaked test, but much more 
information for those individuals whose ability deviates 
from the mean. The retaining barrier was found to provide 
more nearly equal estimates of precision over the range of 
abilities than the reflecting barrier. Although both approaches 
showed some loss in precision at very extreme ability levels, 
each was still more precise than conventional tests at those 
ability levels. 

Summary , The research available on pyramidal testing 
has used a wide variety of subjects, item pools, and test 
characteristics including variations in branching strategies, 
entry points, step sizes, offsets, and scoring methods, 
Adminis tra tion of considerably fewer i t ems ha s re sul t ed in 
shorter testing times when complex instruction*^ and paper 
and pencil formats have been eliminated. Severn 1 p\r;imi<lnl 
tests have shown higher correlations with part^nt tests t h;ni 
conventional tests of the same length, Pymmicial tests (h*- 
signed by Hansen (1)6;) and I, inn e^ aj_, {19^^^>) have obtfiiniMl 
higher correlations with outside criteria than con von t i f>n;j I 
tests. Pyramidal tests have also been shown to prorlure ;j 
more rectangular equidi sc rimina t ing score distribution than 
conventional tests (Hansen, 1969 ), ;md have liigh(»r con*** lo- 
tions with underlying fibility (Waters t^^, Bayrot't', 1^71 ) wh<»n 
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the items are highly discriminating* Thooreticrul studies 
have also sho\vn tliat pyramidal tests have nearly constant 
precision of measurement across all levels of ability • This 
leVHl of precision is much greater than that for convt^nt i ona I 
tests at the more extreme ability levels (Lord, 1970, l^Tla, 
b; Mussio, 19T2; Paterson, 1962) • 

Much of the empirical and simulation research has attempteil 
to determine how highly pyramidal tests correlate with longt»r 
conventional parent tests. Investigators have been concerned 
with constructing short adaptive tests which yield essentially 
the same information as a conventional test. The theoretical 
studies have demonstrated that, for many people, pyramidal 
tests may be more accurate measurement instruments than con- 
ventional tests. If this is the case, then the demonstra- 
tion of a strong relationship between the two testing stra- 
tegies is not of primary importance. One major purpose of 
adaptive testir^^is to obtain measures of ability which are 
more precise than those of conventional tests. When this is 
considered, a high adaptive-conventional correlation is neither 
necessary nor desirable. 

None ox* the studies to date has attempted to assess the 
relative test-retest stabilities of pyramidal and conven- 
tional tests. Furthermore, only Hansen (1969) has studied 
the relationships between the various pyramidal scoring 
methods. The present investigation was designed to supple- 
ment the existing literature on pyramidal tests in these 
areas, and to replicate some of the findings of earlier 
studies using longer pyramids than had been used in previous 
empirical studies. 



Method 

The pyramidal tests used in this study represent onl\ 
one of several strategies of adaptive testing being used 
in a larger series of research studies (e.g., Bet^ Weiss, 
1973) • This series of studies is designed to investigate 
the possible advantages of adaptive testing strategies as 
compared to conventional ability testing procedures , and t o 
determine which adaptive approaches provide the most accurate 
measurement of ability . Adaptive tests are being compa red 
to conventional tests and to other adaptive strcMtegies with 
respect to ability estimation, stability, internal consis- 
tency reliabilities, and other psychometric characteristic's. 
At the same time, the research is concerned with answering 
basic questions about each adaptive strategy. These include 
such questions as optimum ways of structuring the br.tnchiur. 
paradigm, problems in determining branching rules, find dettumi- 
nation of useful and reliable methods of scoring tfie aci.iptive 
tests. 
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All adaptive and conventional tests wore administered 
by computer (DeWitt & Weiss, 197'4). Testing strategies 
wero administered two at a time so that scores on one adap- 
tive test could be compared with those on another, and so 
that adaptive and conventional tests could be compared. 
Each individual was tested on two occasions with a period 
of about seven weeks between the initial and final testings, 
in order to compare the test-retest stabilities of each test- 
ing strategy, and scoring methods within a strategy. 

Test Development 

Item Pool . The iten: pool consisted of 369 five-alterna- 
tive multiple-choice vocabulary questions (see McBride A 
Weiss, 197^ for details of item development and norming) . 
Each item had been normed on groups of college undergraduates. 
Norming resulted in estimates of item difficulty (propor- 
tion correct), and item discrimination indicated by the 
biserial correlation of each item with total score on the 
norming tests. Approximations to the normal ogive item 
parameters "a" and "b" were determined by the following 
formulas (Lord & Novick, I968, pp. 376-378). 




^ - "^-4^ -^(p) (2) 

where a is the normal ogive index for 
di scrimi nation ; 

b is the normal ogive index for difficulty; 

is the biserial correlation coefficient 
between item response and total score; 

f(p) is the inverse of the cumulative normn 1 
distribution corresponding to the pro- 
portion correct 

The item pool was not composed of an equal number of it«'ms 
at each level of difficulty; rather, there were man\ hi/riilv 
discriminating items which wore relatively fns\ , and fewfr' 
highly discriminating items which were difficu'lt. 

Constr uction of the pyramidal tests . Three dlffi-rtrnt 
pyramidal tests were used in this study. Ail were l3-stH;^c 
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fixt'd branching models with a constant step size. All usod 
an up-one/down-one branching rule (Weiss & Betz, 1973; Weiss, 

For pyramids 1 and 2, the following rationaU- was used 
in test construction. Each test was to be administered 
with a conventional test; therefore, those items used in 
the conventional test were excluded from the pyramidal 
tests, in order to avoid a deceptively high correlation 
between scores from the two testing strategies. This re- 
sulted in an important constraint in the construction of 
the pyramidal tests. Since the conventional test was peaked 
at b = O, many highly discriminating items of moderate diffi- 
culty were unavailable for the pyramid. However, the pyra- 
midal structure, as illustrated in Figure 1, shows that most 
items required by this strategy fall into the range of modcrat.o 
difficulty with fewer items required at extreme levels of 
difficulty. In general, n(n+l)/2 items are required for an 
n-stage pyramid. Thus, 15(l5+l)/2 or 120 items were needed 
to build a complete 15-stage pyramidal structure. In order 
to construct a symmetric pyramid of 15 stages having an ini- 
tial item of median difficulty and terminal items which 
ranged in value from -3.0 to +3.0 standard deviations, a 
step size of b = 0.2 was necessary. That is, increases or 
decreases in item difficulty from one stage to the next were 
fixed at a normal ogive difficulty value of 0.2. 

Appendix A shows the item difficulty and discrimination 
structure of the three pyramids used in this study. Tables 
A-1 and A-2 indicate that the initial item presented to all 
testees in pyramids 1 and 2 had a difficulty of b =-.05. 
A correct response branched the subject to a more difficult 
item at stage 2 (b = .21), while an incorrect response 
branched him to an item easier than the first (b = -.13). 
This process was repeated until each subject had attempted 
15 items. Once the difficulty of the initial item and t}u> 
step size had been determined, the remaininfj items in tlu- 
pool were divided into 29 groups, with all items in a fjroijp 
having about the same "b" value and an "a" value of .30 or 
higher. These groups correspond to the 29 columns of iioms 
in the tables of Appendix A, 

It has been suggested by Paterson (1962) that within 
earh column items be ordered according to discrimination, 
v-.irh the most discriminating item appearing first. in 
pyramids 1 and 2, thert? are several exceptions to this iulf, 
as shown in Tables A-1 and A-2. For example, in column \^ 
of Tables A-1 find A-2, the second item is the on*;" \v i t h t he 
highest discrimination. Similarly in column It) of tlif>s.- 
tables the best discriminating item is fourth, not first. 
In constructing these two pyratnids, in cases in wliich tin- 
difficulties of items varied widely within a column, i t<'m 
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difficulty was cons'idered more important than item Uis- 
crimination. Pyramid 3 was structured 550 that item uis- 
ci*iminat ions were ordered from highest to lowest within 
each column (see Table A-3)» 

For the first group of subjects, the pyramid 1 test 
was presented with a 40-item conventional test. After the 
initial administration, two errors were found in the pyra- 
midal tesLj items of inappropriate difficulty were located in 
difficulty level 12 at stages h and 6. Because both items 
appeared in early stages of the structure, many testees 
( about one-third of the group) attempted one or V^oth of 
them. Pyramid 2 was a modified version of the first pyra- 
mid, with the errors corrected. Half the subjects received 
the original pyramid on retesting and the remaining subjects 
completed the modified version in order to see whether errors 
in test construction would significantly affect results. 

Pyramid 3 (Appendix Table A-3) was administered to a 
separate group of testees several months after the first 
two pyramids had been administered. This pyramid was to 
be given with other adaptive tests which used large numbers 
of items from the vocabulary pool. Thus, no attempt was 
made to exclude any items from the pyramid. Since a greater 
number of highly discriminating items of median difficulty 
were avalMable, and since items were ordered within a column 
solely on the basis of their discriminations, the average 
item discrimination for this test was higher than that of 
pyramids 1 and 2. 

Table 1 pret.ents means and standard deviations for the 
difficulties, discriminations, and step sizes of the three 
pyramidal tests. As Table 1 bhows the three pyramids are 
essentially equivalent with respect to mean dif f iculi ties 
of the items (although pyramid 3 is slightly easier than the 
other two), mean item discriminations (although pyramid 3 
has items of slightly higher discriminations), variabllit> 
of both item difficulties and discriminations, and average 
step size. Pyramid 1 has considerably larger variabi 1 i t \ 
of step size than do pyramids 2 or 3t due solely to the 
effect of the two items of inappropriate difficulty pres<'nt 
in pyramid 1. 

Construe t ion of the convent ion a L test . The convciit i on;t I 
test used in the study was a peaked test composed of hO i t f*ms . 
Items with p-values of about .60 and high biserial corrtMa- 
tions were selected from the item pool. Appendix Table A-'« 
presents the normal ogive difficults ancJ d iscrlrnina 1 1 ori 
parameters for each itnm in the conv^n tiona 1 t (^s t . Tfib h» i 
shows means and standard deviations of these norm.-il nf';x\ij 
parameters for loth the difficulty and tii sci*imina t i on ot* t Ik* 
conventional test. As Table 1 ind.i ca t er?, t he mr^an diliirultx 
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or the conventional test (-.188) was lower than that of any 
of the pyramids. The conventional test was constructed to 
adjust its average difficulty for guessing {Betz A Weiss, 
1973» p. 13). On the other hand, the mean difficulty of the 
pyramids was set at the mean difficulty of the group being 
measured. The pyramid was not adjusted for guessing since 
it was assumed that, as a result of the adaptive test's 
capacity to adjust difficulty level to the individual's 
ability, guessing was less likely to occur (Weiss & Betz, 
197'3). Since the conventional test was a "peaked" test, 
the standard deviation of its difficulties was considerably 
If^ss than that of the pyramidal tests, which were con- 
structed to measure along an ability continuum. 

Table 1 also shows that the adaptive tests were composed 
of more discriminating items than the conventional test. The 
latter test was constructed to approximate the conventional 
tests used in Lord's (l970, 1971a, b) studies (see Weiss & 
Betz, 1973). It has been suggested, however, that adaptive 
tests require more highly discriminating items to be effec- 
tive (e.g., Urry, 1970). Thus, the pyramidal tests used 
the most discriminating items available in the item pool, 
within the limitations of the difficulty structure required. 
This latter fact accounts for the larger variability of dis- 
crimination indices for the pyramidal test as compared to 
the conventional test. 

Scoring the Pyramidal Tests 

Six scoring methods were used to estimate ability in 
order to determine which provided the most accurate and most 
stable estimates. Method 1 is the simple number correct 
score which has been used by Lord (1970, 1971a, b). For a 
15-stage pyramid, sixteen different number correct scores 
are possible (O to I5). Method 2 involved computing the 
mean difficulty of all items attempted for each subject, 
Lord (1970, 1971a, b) has suggested a similar approach in 
which the first item is omitted and an hypothetical 16^' item 
is included. Method 3 is analogous to the second; in this 
method, the mean of the difficulties of the correctly answered 
items was obtained. In method 4, a subject's score was the 
difficulty of the final item attempted in the pyramid. Since 
one objective of adaptive testing is to administer items 
appropriate to the ability level of the testee, the point ut 
which he/she finishes the test can b.- considered a good esti- 
mate of ability (Lord, 1970). While Bayroff (1960) used the 
p-value of the final item reached as the testee's scort*, the 
normal ogive parameters used in the present investigation nrv 
more easily in terpre table as an es:imate of the subject's 
ability levels 

Method 5 employs an hypothetical 16^^ item. Since method 
U does not take into account the correctness or incorrectness 
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of the testee's t^inal response, this method branches the 
testoe to an hypothetical item whose difficulty would be 
that of the l6^^ item, were one to be given. Lord (l970f 
1971a, b) has called this the "final difficulty score." 
Values for the n-hl^^ items were computed by averaging the 
difficulties of all items in its column. Values for the 
two extreme n + 1^^ items were obtained by using the mean 
difference between the remaining fourteen items in the 
n + 1^^ stage and adding it (or subtracting it, in the case 
of the lower extreme) to the difficulty of the n + 1^^ item 
adjacent to it . 

Scoring method 6 was the all-item score developed by 
Hansen (1969) • In this method, two points are given for 
a correct answer. In addition, 2 points are added for 
each item in that stage which is easier than the one 
attempted, and one point more is added for the next most 
difficult item in that stage ; all more dif f icul t items are 
scored zero. For an incorrect response, 0 points are given 
for the item attempted and for all items of greater diffi- 
culty in the same stage. One point is added for the next 
easier item in the same stage, and 2 points are given for 
all other items of lesser difficulty in the same stage. In 
this way, all-item scores assign a value to all 120 items 
in the pyramid for each subject, even though only 15 items 
were attempted. In contrast to all other scoring methods 
in which only items actually answered by the testee receive 
a score, this procedure may provide a method for assessing 
the internal consistency reliability of pyramidal tests by 
standard reliability formulas. Scores for this method ranged 
from O to 240. 

Test Administration and Subjects 

Both conventional and pyramidal tests were administered 
by cathode-ray-terninals (CRTs) acoustically coupled to a 
time-shared computer. Items were presented on the CRT scrp(.»n 
and testees indicated their response by typing in the number 
of the correct alternative to the multiple-choice item. 
Following their response, the next item appeared on thi? scrf»en. 
Since the first item of the second test appeared immedintc»l> 
after the final item of the first test, subjects were not 
aware that two tests were being given (see DeWitt & Weiss, 
l'.)7'^, for details of the computer system control Ling t^>st 
administration) . 

Subjects were all undergraduates enrolled in eith<?r 
genera 1 psy cho 1 ogy or psychological measurement and 
statistics courses at the University of Minnesota. None 
had any previous experience with computerized testing. 
Instructional screens explaining the operation of the CWTs 
were provided prior to testing and n proctor was present 
in the testing room to provide further assistance to an\ 
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testee having difficulty with the equipment, Testees 
were? permitted as much time as necessary to complete the 
tests and were so informed before the tests were begun. 

For the Pyramid 1 study, 250 subjects were originally 
tested with both the pyramidal and conventional tests. One 
hundred twenty-five subjects completed the pyramidal 
test first and the remaining 125 were given the conventional 
test first. Each subject was retested about seven weeks 
later. The mean interval between test and retest was 52,5 
days; the standard deviation was 7»5 days, and retest 
intervals ranged from 39 to 70 days. At retest, the group 
was randomly divided into two subgroups; half the subjects 
received a retest of pyramid 1 plus a numeric norming test 
(NslOl); while the remaining half was administered the 
revised pyramid, pyramid 2, and the same conventional test 
(N=:103), Thus, subgroup 1 yielded test-retest data on 
pyramid 1, while subgroup 2 yielded retest data on the 
conventional test and an approximation to an alternate form 
retest for pyramids 1 and 2, 

Pyramid 3 was administered with a stradaptive test 
(Weiss, 1973) to 1^2 testees. On retest, 138 subjects were 
administered the same pyramid and a two-stage test. In 
both administrations, the order of test presentation was 
randomized. Complete test-retest data on pyramid 3 was 
available for 128 subjects. The test-retest interval for 
pyramid 3 was also about 7 weeks with a mean of ^9.2 days, 
a standard deviation of 4,8 days, and a range of ho to 63 
day s , 

Analysis 

The general outline for the studies using each of the 
pyramidal tests is shown in Table 2, The data to be analyzed 
in the Pyramid 1 study consisted of two sets of six pyramidal 
scores , one set for the initial test and one for the re tes t . 
Scores for the conventional test (number correct) wern nvailablM 
only for the initial test on this group. Those testees com- 
pleting Pyramid 1 at time 1 and Pyramid 2 at time 2 nLso hnd 
two sets of six scores. Conventional test scores were nVMilabN- 
for both test administrations. Thus, for this group the tt?st- 
retest stabilities of the pyramidal test could be compared 
with that of the conventional test. No conventional test 
was administered with Pyramid 3. Subjects completing this 
test at initial testing and at retest were scored by the 
same six me thod s used for the other pyramid a i tps t s , 

Thus, the design permitted ana 1\ sis of tlio stf*hilit> of 
scores on pyramid 1 (group l), stability of scor(»s on a 
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pyramiUaL test; with x'evisioiis in the i^tMU structuit? (p\ramids I 
urn] 2 on group 2), test-retest stability for tht? oonvt'tit iuiial 
test (sroup .7), and stability of a pyramidal tost (pyramid 
'j) constructed differently than tho oth«.>r pyramid;-* (^ruup 0. 

Order effects . To determine whethor ordtM' of admini- 
stration significantly affected test scores, the :»3() testers 
who completed pyramid 1 at time 1 (groups 1 and .? ) weru 
randomly divided into two subgroups. The pyramidal test 
WHS administered first and the conventional second to 1.75 
+• stees. The order was reversed for the remaining 125. In 
this way, fatigue or practice effects or carry-over offncts 
between strategies could be detected. T-tests were used 
to determine whether the differwnces between the mean scoifs 
for each order were statistically significant for the initiai 
test administration. Subjects administered Pyramid '3 were 
divided into two subgroups on both test and retost. The 
first was given the pyramidal test first and a stradaptivo 
test (Weiss, 1973) second. The order was reversed for the 
remaining subjects. Since a different adaptive test (a two- 
stage test; Betz & Weiss, 1973) was administered with pyra- 
mid 3 during the retest, testees were again divided into tvvo 
groups with respect to order of administration, and t-tests 
computed for each scoring method. 

Score distributions . Two previous empirical investiga- 
tions using pyramidal testing models have found that score 
distributions have been negatively skewed, with many testees 
obtaining near maximum scores. Seeley, Morton, and' Anderson 
(1962) reported that such a result could be attributed either 
to the scoring method used or the difficulty of the test. 
Bayroff and Seeley (1967), using two 8-stage pyramidal tests, 
found scores distributed approximately normally for the 
verbal section but negatively skewed for both the numerical 
section and the conventional test. Hansen (1969) however, 
found that for one scoring method, a more rectangular dis- 
tribution of scores was obtained with pyramidal tests than 
with conventional tests. 

One objective, then, of the present study was to inves- 
tigate the distributions of scores on the •'♦O-item convi-n- 
tional test and those derived from e.-ich pyramidal scoriii;; 
method. These analyses were designed to examine (I) the' 
appropriateness of test difficulty, (2) the relative varia- 
bilities of each of the various scoring procedures, and 
(3) the shape of the obtained score distributions. 

In order to express the variability of the pyr;imi(l;il 
scoring methods in a common unit, the standard deviations 
for each scoring method were divided by the ran;jf> of poten- 
tial scores and the results expressed as the "proportion 
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qV raiiftt> utili/ocj" (Hei/ A WtMSH, 3^)7'i)» Thu raiif^oH Tor 
oach scoring method wex^:^ UetorminuU in t\w ft)llo\vjng mnnni^r; 
(l) the ••number coi*x*cct'* ran^^e was siinpiy 15-0 = 15 Vov all 
three pyraiuiilsj ( 2) the "mean Uifficulty of all items attempted'* 
ran^^^e was obtaliRHj by >^ubtrariin^ the moan liifriculty seore 
made by a testeo answering all Items iueortMuM;l\ Tram the 
score oV one responding correctly to all 13 items; {^) the 
"mean ditriculty of all items correct" rantje was obtained 
b\ subt rnc t inf5 the lowest possible N+1^^* score from the 
"mean difficulty" score of a testee with 13 correct i e- 
sponst's; ('4) the "final item difficulty" ran^je was the 
difference between tin? easiest and most difficult terminal 
items while (3) the "n + 1**^ item difficulty" ranpe was the 
difference between the two extreme n + 1**^ values; and (6) the 
all-item score range was .?^0 for all three pyramids. Exact 
values for these ranges are summai^ized in /Xppendix D. 

In addition to the mean and variability indices, the 
skewness and kurtosis of each distribution were computed, 
and the significance and direction of its departure from 
normality was determined (McNemar, 1969, pp^ ri5-28, 87-88). 

Stability ♦ Previous investigations of pyramidal testing? 
have usually been concerned w^ith the correlation between a 
short branching test and a longer conventional test. None 
have studied the relative stabilities or internaJ consistency 
reliabilities of conventional versus pyramidal tests. To 
investigate the accuracy of each scoring method, test-retest 
correlations were computed for all testees completing both 
administrations of the pyramidal and conventional tests. In 
order to detect curvilinear relationships in test- re test 
stability, eta coefficients were also computed and each 
bivariate relationship was tested for curvi 1 ineari t y (McNemar, 
1969, pp. 315-317) • These data were expected to yield 
initial information on the relative utility of the various 
scoring methods for making longitudinal predictions. 

To evaluate the effects of the length of the timr 
interval between test and retest on stabilits, subjects 
completing both tests were divided into throe p^roups. IIm* 
fii'St was composed of those testees whose test-retest .ini<'i'- 
val was short (39 to 49 days for pyrrunids 1 ami , i <^ Mi 

(Jays for p\ ramid 3); the next group had a modpt'citt' test- 
retost interval (50 to 58 da\s for pyramids I and .V, 'i? t 

days for pyramid 3) and the last find the lon^^t'st int<'t'- 
va 1 ( 59 to TO linys for p\ rami ds 1 and .V , 5'* t o ()'\ (\;\\ s f Or 
p\ ramid l) ^ Tes t - re t es t cori^M n t i ons wero t h<Mi ra I cu I .1 1 » d 
separately for each group* Both t he t I me i nt er\ii 1 and t he 
number of subjects wt^i^e kept appro.x ima t e 1 \ t^iual for e.ieh 
pxramitl anci the conventional test* 



M em o r\ o t ' f e c t s , Stabiii ty is af footed b> memory • 
Wlu>n a conventional test is administered to the same 
subjucts twice, tOvSt-rotest correlations may be spuriously 
hif?h because subjects may remember how they answered items 
the first time and respond in the same way on second test- 
ing. For a pyramidal test, howevtir, subjects may be ad- 
ministered a different sot of items during the retest 
if they move through the pyramidal structure through 
pathways different from those taken during the original 
test. Thus, it is possible for subjects completing the 
same pyramidal test twice to obtain the same score both 
times, while repeating considerably fewer items than 
would be the case for a conventional test. For this 
reason, memory effects are likely to be smaller in pyra- 
midal tests, and test-retest correlations may not be as 
inflated by memory effects as those for a conventional 
test of comparable length. 

In order to evaluate the effects of memory, the ^0-item 
conventional test w^as divided into two 15-item parallel 
subtests. The shortened conventional subtests were com- 
prised of only 15 items to facilitate comparison to the 
15-stage pyramidal tests. The following method was used. 
A bivariate graph was constructed with item difficulty on 
the abscissa and discrimination on the ordinate. The kO 
items were plotted, and the fifteen pairs of items whose 
"a" and "b" values most nearly matched were selected. 
Members of each pair were randomly assigned to each of 
the two parallel subtests. Item parameters for the items 
of both parallel subtests are given in Appendix C. As 
Appendix C shows, the two subtests could be considered 
parallel since the means and standard deviations of both 
their difficulty and discrimination parameters were almost 
identical . 

Figure 2 indicates diagramma t ica 1 ly the dosign for tho 
analysis of memory effects. The degree of similarity between 
the two parallel forms of the 15-item conventional subtests 
at each of the test administrations is indicated by the two 
vew'tical lines; these are parallel forms reliability coei'- 
firients. The horizontal lines represent the test-retcst 
stability correlations for both 15-item subtosts. Becausu 
all 15 items are repeated this condition allows the maximum 
effect for memory. The diagonal lines show the correlations 
between different 15-item subtests at different times. If 
memory effects were present these correlations should follow 
a specified pattern . First , since subjects attempt the snmt: 
items twice, the stability correlations should bo the hif^host 
in the analysis. Secondly, these test-retest cori'o 1 a t i ons 
would be higher for either subtest than the :orrf» I »m t ion 
between one subtest at time 1 and the other at time .? 
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B'igure 2 

Design for the analysis of memory effects in the conventional test 



Time 1 



Time 2 



Suotest 1 ^ 



Parallel 
Forms 

Reliability 



Subtest 2 



Stability 



Reliability of 
Parallel 
Forms with 
Time Interval 




Stability 



Subtest 1 



Parallel 
Forms 

Reliability 



^ Subtest 2 



since testees would have attempted identical Iteni.s within 
forms and comv^letely different items across forms. Tlie 
latter correlations represent a "no memory" condition. 
These should be the lowest in the analysis, as memor'v tilTects 
would not be present and a time interval sepai'ates the two 
test administrations . Finally , the parallel forms c(U ipI a- 
tions, which involve no repeated items and, therefoie, no 
memory effects shoiild fall intermediate between the memory 
condition (stability correlations) and the no mc?mnry con* 
dition (parallel forms with time interval). 

On the pyramirial test, most testoes coulii be oxpoct(»d 
to attempt an intermediate number of identical items on 
test and retest. Therefore, it would bo expected that 
stability estimates of the pyramidal test would fall betwfM?n 
the extremes of the "no memory plus time interval" and 
"maximum memory" conditions for the conventional 15-itom 
subtests described abovn, if the stability of the pyramidal 
testing strategy did not differ substantially from that ot* 
a conventional test of the same length. 

Change Analysis . When a conventional test is admlnisttM' 
to the same subjects more than once, memory and practice 
effects may operate to increase retest scores. No investi- 
gation has yet attempted to find similar effects in adap- 
tive testing. In order to determine whether scores on the 
conventional and pyramidal tests changed significantly from 
one testing to the next, correlated t-ratios were computeul 
contrasting mean scores for the initial and retest administra 
tions. These analyses were conducted for each method of 
scoring the pyramidal tests and for each pyramid, to determin 
whether scoring methods and/or the structure of the pyramid 
had differential effects on mean score changes. 

Internal Consistency Reliability . Measures of the in to r 
nal consistency reliability of both the conventional and 
pyramidal tests were obtained by the? Hoyt (I^'il) method. 
In order to complete such an index, a scoro for every subject 
on each iten must be computed. As testees completed on 1 \ a 
small fraction of the total numb?r of items in the py r;nn.i f 1 ;i 1 
tests, estimates of the probable scorf^f^ on unat tempted it< rris 
were made according to the procedures^ nV the I 1 -1 1 em 
scoring method described above. The Sf^ea rmfi ri-Jirew n forninl.j 
was used to equate the number of items betwe^m t lip cor)- 
vrr!ir.ional and pyramidal tests since tht^ pyramicifH te'st lisiiir, 
the "a 11 -item" score had thre^ tim^^s t }ie number of items ;is 
thf? conventional test. Hansen (1909) ^mployf-»d a similiu* 
method for obtaining the KH-.?0 relifibility indices for ;i 
number of four-stage pvramidal tests. 
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He la t: ion ships among scoring ?ne thods • To d «^ t h rmi n < » w h j h 
pyramidal scoring methods were most similar and uiiich was 
most highly related to conventional test scores, <»acli scoro 
was correlated with every other score. Both the product 
moment correlation and the correlation ratio woi^e computed 
for each pair of scores. 



Results 

Order E.f f ec t s 

Table 3 provides results of the analyses of tht? efforts 
of order of administration on scores for pyramidal and con- 
ventional tests. Means and standard (levlations for both 
groups completing each test first or second In a paired 
administration are given for each method of scoring tht^ 
pyramidal tests and for the conventional test. Of IV 
t -tests, only one of the t -ratios for the difference between 
the mean scores for each order was statistically signifi- 
cant at the ,05 level. There was, however, a tr<»nd showin.^j 
that when any one of the three pyramidal tests was adm.i ni s t r r 
first, subjects tended to make slightly higher mr^an scores 
than those who attempted that test second. For the con- 
ventional test, mean score differences were also not statis- 
tically significant, but the slight difference in means wrjs 
in the opposite direction. Since order did not appreciably 
affect scores on the pyramidal or conventional tests alJ 
subsequent analyses combined the data from the two order 
groups . 

Score Distributions 

Pyramidal tests . Table '\ shows descriptive statistics 
for the first administration of pyramid 1 and the conven- 
tional test. Similar data is shown for pyramids .? and 
and for the retests of all pyramids and the conventional 
t est , in Appendix D. Mean scores shown in Tab] n 'i for 
both tests indicated that, on the average, the testers 
answered approximately half ( 7 . 90 ) of the fifteon items in 
the pyramid correctly, suggesting that the difficult\ nt 
the test was appropriate for the ability of the subjects 
tested. This result was also founti for the p>rc'imid I T'«>tMSt 
(Appendix Table D- 1 ) , the pyramid .? admi nx s t rn t i f)M (T.iblr 
D-2) and both administrations of pyramid (Tablrs U-'j ai..| 
D-«!4). As might be expected, the mean difficulty scor<» fcf 
all items attempted (.0-0 was higher than t\\u riMsiri (iift'i- 
culty score for all items answered correctly { - . I .V ) , intli- 
cating that testees usually responded incor^r^M: t I \ to tho-i- 
items which were above their abililv level. 
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Standard deviations for all methods of scoring tin* 
first administration of pyramid 1 are given in Table • ♦ ♦ 
Because the scoring methods used for tho pyramiila I tests 
were all on different scales^ the variabilities associated 
with each method are not directly comparable. Thus, Table 
4 shows the standard deviations expressed as a proportion 
of each scoring method's potential range. Inspection of 
Table k (and the supplementary data in Appendix D) indicfit^'s 
that two pairs of scoring methods provided almost identical 
values for all pyramidal tests. The number cormct score 
and the n+1^^ scoring method both used from 16 to 19 percent 
of the possible range. The mean difficulty of all items 
attempted and the all-item scoring methods used from 1^; 
to 23 percent of the possible range. Expressed as relative 
variabilities, the mean difficulty of all items attempted 
and the all-item score had the highest variabilities of the 
pyramidal scoring methods (,22 and .21 in Table 4), The 
mean difficulty of all items correct scoring method was 
lowest in relative variability for pyramid 1, This finding 
was consistent across all pyramids and all administrations 
(see Appendix D) . Thus, the mean difficulty of a 1 1 i turns 
attempted and the alJ-item score seem to provide the great'>st 
potential fcr inter-individual discrimination. 

For five of the scoring methods used in the pyramid J 
study, score distributions tended to be positively skewed 
but not significantly so. Only the mean difficulty of all 
items correctly answered had a slightly negatively skewed 
distribution (see Table . Both trends were also observed 
for the retest of pyramid 1 (Table D-i ) and for pyramid 2 
(Table D-2). All score distributions for pyramid 3 were 
positively skewed (Tables D-3 and D-4) both on initial test 
and retest. However, for pyramidal 3, using several of the 
scoring methods, the degree of skewness indicated a statis- 
tically significant departure from normality. 

Distributions of scores for four scoring methods for 
pyramid 1 were highly platykurtic, as shown in Table 
However, only two scoring method dis tr i but ion?^ T*rmain«»(l 
significantly platykurtic on retesting (Table D- t ) , Mh' 
all-item method of scoring, and the mean difficulty oT -ill 
items attempted method consistently yielded tlie riatr(»sL 
distributions. This finding is in accord with the findin,?^ 
of greater relative variability for these metliods of scoring; 
the pyramidal test , l^esul ts obtained for the p\ ram L(i i! 
administration (Table D-2 ) were similar to thosp for r h*- 
pyramid 1 retest, with all scorxng methods producin^^ 
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platykurtic distributions and the same two scoring methods 
showing significant departures from normal kurtosis. 

For pyramid 3 the tendency for flat distributions was 
still present but to a lesser degree (Tables D-3 and I)-'0. 
Only the all-item scoring method for thc» initial admini- 
stration was significantly platykurtic. 

Conventional test > As Table h shows, the mean score 
for the first administration of the conventional test was 
22,7 3. Since the test was composed of '*0 items and guessing* 
was possible, this mean score was appropriate, indicating 
that the test was peaked at the difficulty level of the 
group being tested. Retest scores (Table D-2) had a mean 

of :>3.^o. 

The variability of scores on the conventional test, 
expressed as the proportion of range index, was similar 
to that of the better pyramidal scoring methods. On 
retest (Taole D-2) the two best pyramidal scores utilized 
more of their potential range (.23) than did the conven- 
tional test (.21). Further, there was a slight, but non- 
significant, tendency for scores on both admin ist ra t i on s 
of the conventional test to be positiv.ely skewed. The 
score distribution for the conventional test was highly 
platykuric for the first administration, indicating a 
statistically significant difference from normality. The 
di stribution remained platykurtic on re testing bu t was not 
significantly different from a normal distribution. 

Test- Re test Stability 

Pyramidal tests . The stability data for the pyrami(Jnl 
tosts in Table 5 permit a comparison of the relative stab.ilj- 
tit^s of the \'arious methods for scoring pyramidal tests. 
For the pyramid 1 /pyramid 2 data, throe scoring methods 
yielded substantially lower s tabi 1 i t i s . These methods 
were number correct, difficulty of tlu- n+P^ it^.^rn, and 
difficulty of final item. This pattern of re»sults was 
also observed for the pyramid 3 retest and the pyramid I 
retest, using the eta coof f icient s . It is intortvstin;; to 
note that two of those U^ast reliahje scoring nn* thods \voi 
among those used by Lord (l>70, l/71b) in his thoroticnl 
studies of pyramidal tests. The most stable soorin^^ motliods 
for scoring the pyramids were the a] 1 -item scorr^ and t\iv 
mean difficulty of all items attempteri score* Rased on the 
test-retest eta coefficients, mean diffic!ilt\ of all items 
correct was consistently the third most stable* scoring 
method but was substantially lower than thn othnr two in tlio 
pyramid 1 retost analysis. 
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In genera 1, pyramid 1 was the least stable of the pyramids, 
yielding results substantially below the retest using the 
corrected pyramid This was probably due to thn errors 

in the construction of pyramid I which introduced error into 
the scores on both testings. Pyramid 3 was slightly less 
stable than the pyramid l/pyramid 2 administration. The 
differences might be attributable to the differences in 
construction of pyramid 3 , or to characteristics of the 
subjects . 

Conventional test . Table 5 also shows the test-retest 
reliability coefficients for the 4o-i tem conventional test 
based on the same 103 subjects who completed pyramids 1 and 
2. The stability for the conventional test was r=.92 (eta=.^;3). 
These were higher than any of the corresponding stabilities 
for the 15-stage pyramidal tests. However, a comparison of 
the eta coefficients for the two testing strategies shows 
that the pyramidal test, composed of only 37.5% of the number 
of items in the conventional test, was able to achieve stability 
coefficients not significantly different from those of the 
conventional test. Both the all-item score and the mean 
difficulty of all items attempted score yielded test-retest 
eta coefficients of ,92, and the mean difficulty of all items 
attempted score achieved an eta stability of •91. These com- 
pared favorably to the 40-item conventional test stability 
of .93 for the same subjects. It should also be pointed out 
that the pyramidal data were based on a modified pyramid at 
retest (pyramid 2) making the stability correlations not pure 
test-retest correlations for the pyramidal tests. 

Stability comparison . A valid comparison of the relative 
stabilities of the conventional and pyramidal tests was based 
on the analysis of memory effects for conventional tests of 
length equal to that of the pyramidal tests. The memory 
analysis was based on the assumption that subjects complet- 
ing the 15-.stage pyramidal tests on both test ?ind rotest woultl 
not attempt the same 15 items on each administi^at ion . Fnr 
the 101 examinees completing pyramid 1 both times, the m'»nn 
number of items in common was 8.17 with a standard deviation 
of 3.67. Only five subjects followed the same pathways 
through the pyramid on both administrations (l.o., answf^rrMi 
the same fifteen items both times). The mean number of i i rms 
in common for the 103 testees in the pyraml(] l/pyramid T: 
group was 8.25; the standard deviation was 3.'j7, and thif^'» 
subjects used tho same pathways on both administrations. 

The test-retest correlations for both 1 5- i t em fiarall^*! 
conventional subtests and the correlation of oru» fofm witli 
the other across time are pros^^nted in Table 5 and summai ix^Ml 
in Figure 3. These data serv^» as a ha si s for cnmpa r ison ^1 
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Figure 3 



Test-retest stability, parallel forms reliabilitins and 
parallel forms stabilities for two 15-item parallel conventional 

tests (N=103) 



Time 1 
Subtest 1 



Subtest 2 



r = .88 (eta = .89) 




Time .? 
Subtest 1 



CO 

11 



r = .85 (eta = .89)* 



Subtest 2 



*Curvilinearity statistically significant at p = .t)J5. 
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thv 1.5-stage pyramid and 15-item conventional tests. It 
can be seen that when the same items were presf^ntPd at 
both test and retest, scores were more highly corielated 
(rss,88 and .85) than when scores on one form were compared 
with scores on the other during retest {r=:#78 and #71) or 
when scores on different parallel forms were corroLated 
at the same administration (r=:,75 and #79) • Thesr results 
are in accordance with the results predicted above and 
therefore are cons i stent wi th an hypothesis that memory 
effects were operating in the conventional test to inflate 
test- re test reliability coefficients. Thus , if the pyra- 
mida 1 tests (with an average of only about half t}ie number 
of items repeated in comparison to the conventional tests 
of equal length ) had stabilities equal to those of the con- 
ventional tests, their stability coefficients should lie 
between the "no-memory" results and the "maximum memory" 
results • 

The data in Table 5 show that three methods of scorlnjv 
the pyramidal test (the two mean difficulty scores and the 
al 1-i tem score) yielded stability coefficients which were 
comparable to those of conventional subtest 1 and greater 
than those of conventional subtest 2 (i#e,, maximum memory 
effects). All pyramidal scoring methods showed higher 
stabilities than the "across forms" correlations of the 
parallel conventional tests (no memory effects). Thus, 
when the effects of memory are taken into account, the 
pyramidal testing strategy shovs greater stability than a 
conventional test of the same length. 

A comparison of the eta coefficients in Table 5 
supports the conclusion that the pyran:idal test yields 
more stable scores than the conventional test. Three 
methods of scoring the pyramidal tests yielded eta 
stabilities (,92, .91, .^^2) higher than those of eithor 
of tlie two conventional subtests (.89). This finding is 
especially significant in that the convent iona 1 subt es ts 
allowed the possibility of maximum memory effects while 
the pyramidal test permitted an avera^':e of only half thu 
potential for memory effects to ooerat , 

Since the pyramid I and pyramid } ret^'Sts used differ- 
ent subjects than the r^tests of thr* convention j1 t^sts, n 
direct comparison is not completely n ppropr i n t e . Howevor, 
it is interesting to note that even undrr those circum- 
stances, the best methods of scoring the pyiv-jmidal tpsts 
yielded eta stabilities equal to or f^r(*ater rfian thoso for 
the convention a I tests with maximum m*»mory effects . 
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The finding that the stabiJity analysis of tht? two 
conventional subtests foUowoci a pattern consistent with 
the hypothesis of memory effects inflating test-retest 
reliability coefficients also suggests that the stabilitv 
coefficients for the 40-item test aro inflated by memory' 
effects. From this perspective, the retest eta coeffi-' 
ciont of .93 for the 40-item conventional test (with 
kO items repeated at retest for all testees) compares 
very unfavorably with the retest eta of .92 for the retest 
of the pyramidal test (with an average of 8.25 items re- 
pr a ted ) . 

Retest interval . Table 6 presents the test-retest 
correlations for the conventional and pyramidal tests as 
a function of the time interval between administrations. 
In general, there was little systematic variation in 
stability with respect to time interval for either the 
pyramidal or conventional tests. When subjects completed 
pyramid 1 at time 1 and pyramid 2 at time 2, the medium 
time interval showed the greatest stability. For Pyramid 1, 
the short and medium time intervals showed similar stabilities 
while the long time interval had higher correlations xider 
each method of scoring. For pyramid 3 the highest test- 
retest correlations were obtained for the short and long 
time intervals. No general trend is apparent for the 40-item 
conventional test. 

As shown in Table 6 the test-retest correlations for 
both 15-item conventional subtests were higher than those 
for the pyramidal tests for the short and long time inter- 
vals. For the medium time interval the two mean difficulty 
scoring methods and the all-item scoring method showed 
higher stabilities than either of the shortened conventional 
tests. All pyramidal scoring methods were more stable than 
conventional subtest 2 for the medium time interval. 

Change analysis . Correlated t-ratios comfiaring mean 
scores obtained on both administrations of the tests are 
presented in Table ?. None of the pyramid I change scores 
were significant at p = .05, and only the mean difficuLtv 
scoring methods showed significant increases when mean 
scores for pyramid 1 (time l) and pyramid 2 ( t im»' 2) wejr 
compared. The latter result is mosi likel> dm.- to tho 
modifications made in pyramid 2 to c^orrect the two it^ms 
of inappropriate difficulty found in pyramid 1, sinc«^ 
the mean difficulty scores would be most affectod by this 
change. 
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For pyramid 3f the probabilities associated with four 
of the six scoring methods were less than .05. Thus, the 
increases from time 1 to time 2 were statistically signi- 
ficant, in one case at the ,001 level. On the conventional 
test, scores were higher on the i^etest and the difference 
was significant at the #01 level. 

The significant increases in test scores between test 
and retest seen in the results .^or pyramid 3 contrast sharply 
with the nonsignificant increases for the retest of pyramid 
1 and with those for pyramid l/pyramid 2 retest. The time 
interval between test and retest was approximately the samt* 
for all administrations, so it can probably be ruled out as 
a cause of this discrepancy. It is possible that charac- 
teristics of the subject groups contributed to the differ- 
ence. 

Also , differences in the cons true t ion of the pyramida 1 
tests and/or differences in administration could have 
caused the significant mean differences for pyramid 3. 
As was indicated earlier, pyramid 3 was constructed using 
all available items in the item pool regardless of whether 
they were to be administered lender another adaptive strategy, 
whereas in constructing pyramids 1 and 2 item overlap was 
avoided. As Table 1 shows, pyramid 3 was first administ err?d 
with a stradaptive test which had a considerable degree of 
item overlap with the pyramidal test. As a result, testees 
would likely be administered a subste^ntial number of common 
items on first administrat i on . This might result in a 
greater memory effect on retest than when the testees 
answered each item only once on first administration, as 
they did in pyramids 1 and 2. 

The very significant increase in mean scores upon re- 
testing for the conventional test is likely to be a func- 
tion of memory and/or practice effects, in comparison to 
the general absence of such effects for the corrected 
pyramidal retest (pyramid l/pyramid 2) for the same ^roup 
of subjects. These results support the memory analyses 
reported above suggesting that scores on pyramidal tesis 
arc* less affected by momory than those of c onvon t .i ona 1 t.>sts. 

In t ornal Consi^toncy Rq 1 iabi 1 1 1\ 

The Hoy t { r> I ) i ndex of in t t>riia 1 c oils i s ten cy r-r I ia • 
billty for the 'lO-itom conventional test was .Hn for the 
initial administration and . ^iO f*or the retest. When the 
Spearman-Brown correction for triple length uas m.^^^mI (in 
order to make the convf^nt iona 1 test comparabJ(> to a p>ra- 
midai test of 120 items) the relirMbilit\ in<*reasfMl t.o'.'.'f. 
for both test and retest. For evf>r\ administration of t h«' 
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pyramidal test using the all-item score this index was 
•09, It would appear, then, that the all-item method 
of scoring (Hansen, I969) yields a reliability coeffi- 
cient which is spuriously high. Such a result may be 
due to the strong assumptions made about the monotonic 
relationships of item difficulty and testes response in 
computing scores under this method. Under this scoring 
method, error does not affect the items a person does not 
attempt . 

Relationships among Scoring Methods 

Table 8 presents the int ercorrelation matrix for all 
pyramidal scoring methods for pyramid 1 and the correlacions 
between pyramidal scoring methods and the conventional test 
scores. Similar data for the other pyramids are shown in 
Appendix £• 

Pyramidal vs. conventional scores . For pyramid 1 the 
all-item score correlated more highly (r=,86) with scores 
on the conventional test than any other scoring method. The 
mean difficulty of all items attempted scoring method corre- 
lated nearly as highly (r=:.85) with scores on the conven- 
tional test as the all-item score. The same two scoring 
methods were most highly correlated with the conventional 
test when pyramid 2 was used (Appendix Table E-l). For both 
pyramids 1 and 2 the number correct method as well as the 
n+1^ scoring method correlated lower with the conventional 
test than did the other methods. 

Methods of pyramidal scoring . For all test administra- 
tions (TaFlTSand^Appe^ highest values obtained 
in the intercorrela tion matrices were those between the 
number correct and difficulty of the n+1^ item scoring 
methods. Such a correlation should always equal 1.0 as the 
16 possible scores for the number correct method (O throiw:h 
15) correspond exactly to the scores of the I6 n+l^*' diffi- 
culties, no matter how such difficulties are computed. 
Lord (1970) has also shown this to be the case. All testef-s 
answering a given number of items correctly will be branchtul 
to the same n + 1^^ terminal position in the pyramicinl structure, 
regardless of which items were correct. The assumptions 
needed are that the values for the n + 1^^ scores increase 
monotonically and that these items are equally spaced on 
tho difficulty continuum. In a properly constructed pyramid 
this must be the case. However, due to the two item place- 
ment errors in pyramid 1 the Pearson correlation between 
these two scoring methods for pyramid I (Table 8) whs onl\ 
.99. For the other pyramidal tost (Appendix K) this corre- 
lation was 1.0, as would be expected. 
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Another strikingly high correlation observed for all 
administrations of the pyrnmids was that betwomi t lie mean 
difficulty of all items attempted score and the aJ l-item 
scores (r=.99 for pyramid !)• This high correlation 
account's for the fact that stability estimates for thesp 
two scoring methods were nearly always equal; stabilities 
of these scoring methods were always higher than those 
for any other scoring method. Such a strong relation- 
ship might not be expected as the all-ilom score appears 
to have only a very approximate relationship with the actual 
item difficulties. The lowest correlations among pyramidal 
scoring methods involved the mean difficulty of all items 
cori'ectly answered score. This finding, in conjunction with 
the comparatively low stabilities of this scoring method , 
suggest that it is che least valuable pyramidal scoring 
method. The mean difficulty of all items correctly answer^nl 
correlated more highly with the mean difficulty of all items 
attempted than with any of the other scoring methods. This 
was expected since both methods involve only simple averaging 
of the difficulties of some or all of the 15 items administered 
to an individual. Thus, the mean difficulty of all items 
correct also correlated highly with the all-item score . 

The difficulty of the final item scoring method corre- 
lated highest with the n+1^ method and total number correct 
methods. Since* for a certain final item, only two n + 1*^ 
scores are possible given the structure of the pyramid, 
such scoring methods will be very highly related. However, 
the correlations will not be 1.0, since some of the testees 
answer the final item correctly while others do not. 

The all-item scoring method correlated highly with more 
scoring methods than any other. This finding contrasts 
sharply with those of Hansen (l96'j). In that investiga- 
tion the all-item method had the lowest relationship to 
the other scoring methods used. 



Discussion and Conclusions 

The order of test administration WcMS not t'oiind to ^ ir.' 
nificantly affect mean scores for th(*r pyr*amida 1 or rrn- 
ventional tests. The trend for pyramidal te*st S'^or«^s i< 
be lower when the pyramid was administered afttM- r.hf V)^ 
item conventional test suggests that fatigu'- max hav*^ 
affected the testees to some small •xtpnt. hi a study of 
two-stage tests, Betz and Weiss ( 1 9 T '0 found •)rMlpr otferts 
to be non-significant. 



The pyramidal tests used in this j^tudy W'»7*o tVnjnd r<» In- 
of appropriate difficulty for the ability of rlio t^^stn-^. 
This is shown by the fact that, for all administrations. 



the mean number of items correctly answered was slightly 
more than half of the total number of items admin i stertnl • 
Such resul ts were not obtained by Seel e> , Morton and 
Anderson (1962) in their paper and pencil administration 
of pyramidal tests. In that case a large percent fifje of 
testees obtained the maximum sc ore • Thi s might have 
been due to the easiness of their test or to the exclu- 
sion of many test papers submitted by tostees of lower 
ability who had difficulty in following the branching 
instructions. When Bayroff and Seeley {I967) administered 
branched tests by computer, scores on a verbal item p>ra- 
mid were dis tribu ted approxima t e 1 y norma lly . Thus , it 
appears that when a good estimate of the general ability 
level of a group of individuals is known in .::dvance, a pyrami- 
dal test of appropria te dif f icul ty can be constructed • 

In contrast to the highly negative skew in the Seeley 
e t al . study , distributions for pyramids 1 and 2 were 
approximately normal with a slightly positive skew for 
most scoring methods* Only the average difficulty of all 
items correc tly answered score produced a negatively skewcKi 
distribution, but again the distribution was approximately 
normal • For pyramid 3 however, the departure from normalit> 
was significant and in a positive direction. This result 
was unexpected as pyramid 3 was slightly easier than the 
others • 

The trend for most of the pyramidal distributions to 
be platykutic has been noted by Hansen (1969), who obtained 
a rectangular score distribution. For pyramids 1 and 2 most 
of the score distributions were significantly flatter than 
the normal distribution while for pyramid 3 almost all 
were not. 

The conventional test used in the present study 
also yielded scores which were significantly platykutic. 
As Betz and Wei ss (197.3) have pointed out, this may have 
been a function of deviations in the peakedness of the 
conventional test, with a more highly peaked tF«t produc- 
ing more nearly normal score distributions. 

While Betz and Wriss (1,7'}) ha' found tli.-.t th.> r.vx 
stage testing strategy yielded scor.'S which utilize n 
higher proportion of tlie score range than a conventional 
test, the pyramidal tests in the pres.^nt stiid> used m 
percentage of range equal to or sligfitly greatct- tli.m 
that of the conventional test for only two of tin- six 
scoring methods used. These were tlie menn riiffLcultv of 
all items attf'mpted and the fill-itom scoros, uhiph won 
later shown to correlate .99. 
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A comparison of the scoring methods used for the pyra- 
midal tests indicated that the most stable were the mean 
difficulty of all items attempted and the all-item scores. 
Test-retest correlations for these- scoring methods approacht-d 
those of the 4o-item conventional test. This fimling 
supports Lord's (1^)70, 1971b) contention that tho overage 
difficulty score is the most appropriate way to ftcr)r'(-? a 
pyramidal test when the up-one/down-one branching rule is 
used. In each of the pyramids these scoring methods were 
consistently more stable than either the difficult\ of tho 
n + 1*'' item scoring method, or the number correct score, and 
they also correlated more highly with conventional test 
scores than any other scoring methods. One possible expla- 
nation for the good results obtained with thoso two methods 
is that they utilize more information than the other scoring 
methods and take account of the different pathways through 
the test structure. As most of the earlier studies of 
pyramidal testing (Bayroff, Thomas and Anderson, 1460; 
Seeley, Morton and Anderson, l^Jb2 ; Waters, 196'4; Piayroff 
and Seeley, 1967; and Waters, l';70) hnve used a simple 
rank ordering of scores essentially equivalent to the 
number correct score, or n-^-1'^*' item difficultv score, the 
correlations with parent tests obtained in these studies 
might have been higher, had either of the bettor scoring 
methods been used. 

The time interval between test administrations did 
not affect the stabilities of either the pyramidal or 
conventional tests in any consistent manner. But the 
intervals used were restricted to between six and ten 
weeks. Longer time intervals would be appropriate to 
show more clearly whether pyramidal testing provides 
estimates of abilities which are more stable over time thfin 
those of conventional testing. 

The analysis of memory effects in the present sturly 
indicated that pyramidal testing provides estimates of 
ability comparable to conventional tests of the same 
length even though in the conventional test testees 
attempt the same items at both test and retest, result- 
ing in an inflated estimate of stability due to memory of 
previous responses. When the effects of memory werp con- 
trolled for, the pyramidal tests showed highe r s t a b i 1 i t i t< s 
than conventional tests with the same number of itom^. 

The analysis of the change n mean scores from t<'st 
to retest indicated that scores on the conventional test 
increased significantly. For the pyramidal stratf^^y thf^ 
significance of increases in test scores dcpfMuifd on thr 
scoring method used and the particular pyramid involvtnl. 
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For pyrMiniii I, nono of the differences obtainBd b\ c»Mch 
rtcoring mothod wei^e significant, Foi* pyi'anild , all ^Ui^ 
differencos obtainc-^d (except fot^ tho two mi^an difficult) 
s?corf?s) wei^e statistically significant. This rf^sulr may 
have been an ax^tifact of tVie method.s of construction ar^i 
administration of p> ramid 3i ii^ contrast to that of the 
other pyramids. Significant increases in mean scor«»s for 
th<» pyramid l/pyramid 2 administration were found u^^ing 
the mean difficiilt\ scores only* These I'esults wer<» iikt»l\ 
due to the errors m the construction of the brancfiing 
network of pyramid 

The internal consistency re 1 iab i 1 i t i t»rf foi' ; > ramida 1 
tests obtained by Hansen (l^^b>) for several thrm- anci 
four-stage pyramids scored by the all-item metho<l \\»'re 
quite hight The present stU(J> also obtalnt'd extrenu'ly 
high internal consistency reliability for this scfuing 
method » The all-item scoring method, how<*\er, makes a 
s trong assumption about the correctness of t*espons<'S to 
unattempted items based on actual respons<»s, A corr(»ct 
response to an item is taken as evidence that all '^.isier 
items in that stage will be answered cc.u'r'«'Ctl\ wh i 1 ^» almost 
all more difficult items will be answere<l inc rr«^rt.ly. In- 
ternal consistency reliabilities calculated from such hypo- 
thetical response patterns would thus seem to be seriously 
overestimated. At present, then, the internal consistenc>- 
reliabilities of adaptive tests would seem to be unmea sn ral) 1 
by conventional methods which require ^\ res[)onse to each iten» 
by every individual. In one recent study of adaptive testing:, 
Betz and Weiss (1973) were able to measure intf^rnal consis- 
tency reliability for two-stage tests on 1 \ b\ con s i d(»ri n;\ 
the routing and measurement tests as separate conventional 
tests. 

f'r)inpfj ri s on of the scoring methfxls used i ri<l i t «•(! thr"ef 
important facts: (l) the mean dift'icultx of all items 
attempted correlates very highly with the all -item score; 
(2) the number correct and difficulty of the n+1'^' item 
scores a re also perf ec 1 1 y cor re 1 fi t ed given a i)r ope r 1 > c on - 
structed pyramid, as has been shown b\ Lord (ri7(), 1^711)); 
(l) the all-item score correlates highly with nioiM- othei* 
scoring methods than any other scoring method. Hansen 
(lw()f)) found that the all-it(}m scor*ing method had t la* lowest 
oN'er- all re» 1 a t ionshi p to thi^ee other scoring met ho<ls usi-d . 
This disct^epancy may be due to thf* extreme I > short tests 
useil in Hansen's stud\ or to th(^ fact that two of Hansen's 
other scoring methods were not used in the pi^esent studs. 
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The major deficiency of the present study was the 
presence of two errors in pyramid 1, These two items 
of inappropriate difficulty may have served to increase 
the mean scores of the pyramid as they were relatively 
easy items located in positions designed for more diffi- 
cult items. Seeley, Morton and Anaerson (.1962) have 
encountered similar difficulties with their sequential 
item tests. In that study, "despite repeated checking 
and cross-checking the ... tests administered in the field 
showed a number of construction oversights which would re- 
quire correction before further use could be made of the 
tests" (Seeley et. al . , I962, p. 7). The effects of errors 
in estimating the difficulties and discriminations of items 
in pyramidal tests were investigated by Paterson {l<)62). 
He found that errors in item difficulty were insignificant 
when they occurred early in testing. This would s«'em to 
indicate that the branching process serv«;s to reduce thn 
effects of items of inappropriate difficulty. As the (>rrors 
in pyramid 1 were in the fourth and sixth stages, th»' f>ffectf 
of the errors on the scor3 distribution may have been negli- 
gible. The results of the present study support Pa.erson's 
finding, however, since the test-retest correlation of scor('^ 
on pyramid 1 and pyramid 2 were still higher than those of 
equal length conventional tests when memory effects were 
taken into account. That .^hese results were obtained from 
the administration of a pyramidal test with two errors in 
item placement indicates that pyramidal adaptive tests with 
errors in their construction will give results similar to 
those of properly constructed pyramidal tests. 

The findings of the present srud> suggest that py r.-mi i d, . ' 
testing can provide estimates of ability which have stabili- 
ties comparable to those of longer conventional tests and 
greater than those of conventional tests of the same lengtli. 
Further studies will be needed to determine whether pyramiiia 1 
testing provides more precise ability estimates throughout 
the entire range of ability than those of conventional tests 
and whether pyramidal tests correlate more higlilv with nn 
external criterion of ability than conventional testing mr-thod 
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Item Dif f icul ty and Discrimination 
Parameters for Items of the Three Pyramidal Tests 
and the Conventional Test 
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Table A-h 

Item Difficulty (b) and Discrimination (a) 
Parameters for the Conventional Test 



Item 



Reference No. 


b 


As 


58 


-.957 




221 


-.7hO 


. 64 7 


307 


-.836 


. 562 


386 


.136 


. 697 


211 


-.720 


. 60') 


22k 


-.785 


. 5n 


390 


-.731 


. b27 


667 


-.726 


. 568 


156 


-.631 


. 647 


208 


-.681 


. 582 


2 3^ 


-.687 


.512 


52 


-.282 


. 600 


137 


-.739 


. 400 


176 


-.897 


. 338 


207 


-.526 


. h02 


218 


-.928 


. 332 


205 


-.618 


.472 


382 


-.481 


.638 




. 172 


.774 


265 


.173 


.772 




-.320 


. 501 


661 


- . 296 


. 579 


670 


-.282 


. 620 


327 


- . 248 


. 571 


50 


- . 234 


. 505 


-I ! } 
^ 1 k ♦ 


- . 184 


.027 


369 


- . 215 


. 562 


233 


-.172 


. 468 


13V 


. 189 


.417 


633 


-.078 


. 501 


146 


.000 


.t)07 


295 


-.035 


. ♦ / 


113 


.247 


. 609 


267 


.188 


. '» 30 


59 


.173 


. 0 37 


147 


1. 152 


. 38 3 


174 


1 . 156 


.6 38 


2k2 


.979 


. 3J0 


306 


.969 


. 490 


367 


.978 


. 377 
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Appendix B 

Possible Score Ranges for Three Pyramidal Tests 



Scoring 

Method Pyramid 1 Pyramid 2 Pyramid 'J 

Number Correct 15 15 15 

Mean difficulty 
of all items 

attempted 2.97 2.91 2.79 

Mean difficulty 
of all items 

correct 4.58 k , 38 h.h2 

Difficulty of 

Final Item 5. 81 5.81 5.'*8 

Difficulty of 

N+l*^ item 6.21 6.21 5.98 

All-item score 2hO 2k0 2h0 
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Appendix C 

Difficulty (b) and Discrimination (a) Item Parameters 
for two 15-item Parallel Conventional Subtests 



Subtest 1 Subtest 2 

Item Item 



reference no. 


b 


a 


reference no. 


b 


n 








307 


-.836 


. 5t'2 




- . 785 


. 5-43 


r\ y 

386 


. 1 -Hi 


.097 




- .731 


. 627 


211 


-.720 


.609 


A ^ "7 

Do ( 


- •726 


. :?68 


156 


-.t)31 


.b47 


1 7^ 
I (O 


Q 'y 

- .897 


. 338 


208 


- . 681 


.582 


382 


-.^481 


.638 


52 


-.282 


. 600 


3-42 


. 172 


.774 


207 


-.526 


. 002 


670 


-.282 


.620 


218 


-.928 


. 3 3.V 


50 


-.234 


. 505 


265 


. 173 


7 '> 


ihk 


-.184 


.627 


661 


-.296 


. 57<> 


3o9 


-.215 


. 562 


327 


-.248 


. 371 


295 


-.035 


Ji7h 


233 


-.172 


.4()8 


267 


.188 


.436 


139 


. 189 


.417 


59 


.173 


.637 


633 


-.078 


. 501 


242 


.979 


. 310 


367 


. 978 


.'577 


Mean 


-.253 


. 554 




- . 202 


. !j "3 "5 


s . d . 


.505 


. 124 




. 500 


.118 
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Appendix D 



Descriptive Statistics for Score Distributions 
of Pyramidal and Conventional Tests 
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